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(54) CHARACTER ARRAY RETRIEVING METHOD 
(57)Abstract: 

PURPOSE: To provide the character array retrieving 
method which obtains a retrieval result wherein an error 
is allowed in a short time. 

CONSTITUTION: A character array registration program 
20 1 which registers a character array and a character 
array component table generation and registration 
program 202 which generates a character array 
component table by gathering character components of 
constant length used in the character array 1 03 without 
any duplication are executed and then an error 
permissible character array component table search 
program 203 which extracts only a character array 
containing character array components in a retrieval 
character array by more than a certain number based on 
the permissible rate of retrieval errors and an error 
permissible character array search program 204 which 
makes a character array search as to the extracted 
character array to retrieve character arrays satisfying 
the permissible rate of retrieval errors are executed in 

order under the control of a hierarchical retrieval control program 206, thereby outputting a 
retrieval result. Then character arrays which exceeds the error permissible rate among the 
retrieved character arrays are discharged before the character arrays are referred to. and the 
character array below the error permissible rate are retrieved among the retrieved character 
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* NOTICES * 

Japan Patent Office is not responsible for any 
damages caused by the use of this translation. 

1 This document has been translated by computer. So the translation may not reflect the original 
precisely. 

2.**** shows the word which can not be translated. 
3.1n the drawings, any words are not translated. 



CLAIMS 



[Claim(s)] 

[Claim 1] In the character-array search method which searches the retrieval character array 
specified out of the character-array database with which two or more character arrays were 
registered (1) The step which creates the character-array component table with which the 
partial character array which the predetermined die length (referred to as k) contained in said 
registration character array follows is included without duplication, and the information about 
these partial character array is expressed, (2) The step which doubles said registration character 
array and said character-array component table, and is registered to a character-array 
database, (3) The step which extracts the subset of a retrieval character array from the 
character array of said predetermined die length (k) contained in said retrieval character array by 
the predetermined approach, (4) Said character-array component table is referred to for said 
registration character array containing more character arrays in said subset than the fixed 
numbers defined with the predetermined rate of error allowance. The step which searchs the 
character-array component table for extracting the character array of less than of said rate of 
error allowance, (5) Character-array search method characterized by having the step which 
searchs the character array for extracting said registration character array of less than of said 
rate of error allowance with reference to said registration character array obtained by the step 
which searchs said character-array component table. 

[Claim 2] Said character-array component table is received at all the combination of said partial 
character array and said registration character array. The bit information storing field where the 
1-bit information showing whether said partial character array is contained in said registration 
character array and that information which is not included is stored is given. The step which 
searchs said character-array component table is a character-array search method according to 
claim 1 characterized by referring to said bit information storing field for said every registration 
character array about said subset. 

[Claim 3] Said character-array database is a character-array search method according to claim 
1 or 2 characterized by overlapping more than predetermined die length mutually, and including 
the partial character array of wrap plurality for the overall length of said registration character 
array from each of said registration character array. 

[Claim 4] In the step which extracts the subset of a retrieval character array from the character 
array of said predetermined die length (k) contained in said retrieval character array by the 
predetermined approach From the alphabetic character by the side of one end of said retrieval 
character arrays, the number of alphabetic characters (referred to as ks) every by performing 
actuation of shifting and obtaining the character array of the sequential aforementioned 
predetermined die length (k) as much as possible The character-array search method according 
to claim 2 or 3 characterized by extracting the subset of said retrieval character array. 
[Claim 5] The character-array search method according to claim 4 with which said 
predetermined die length (k) and die length of said number of alphabetic characters (ks) are 
characterized by the equal thing. 

[Claim 6] In the character-array search method which searches the retrieval character array 
specified out of the character-array database with which two or more character arrays were 
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registered (1) The step which creates the character-array component table which contains 
without duplication the partial character array of the predetermined die length (referred to as k) 
contained in said registration annular character array supposing the annular registration annular 
character array which connected the both ends of said registration character array, (2) The step 
which doubles said registration character array and said character-array component table, and is 
registered to a character-array database, (3) The step which extracts the subset of a retrieval 
character array from the character array of said predetermined die length (k) contained in said 
retrieval annular retrieval character array by the predetermined approach supposing the annular 
retrieval annular character array which connected the both ends of said retrieval character 
array, (4) Said character-array component table is referred to for said registration character 
array containing more character arrays in said subset than the fixed numbers defined with the 
predetermined rate of error allowance. The step which searchs the character-array component 
table for extracting the character array of less than of said rate of error allowance, (5) 
Character-array search method characterized by having the step which searchs the character 
array for extracting said registration character array of less than of said rate of error allowance 
with reference to said registration character array obtained by the step which searchs said 
character-array component table. 

[Claim 7] The step which the bit information storing field where the 1 bit information showing the 
information on whether said partial character array is included or it does not contain is stored in 
said annular retrieval character array is given to each of said partial character array for said 
every annular registration character array in said character-array component table, and searchs 
said character-array component table is a character-array search method according to claim 6 
characterized by to refer to said bit information storing field for said every registration character 
array about said subset. 

[Claim 8] In the step which extracts the subset of a retrieval character array from the character 
array of said predetermined die length (k) contained in said annular retrieval character array by 
the predetermined approach From the alphabetic character by the side of one end of said 
retrieval character arrays, the number of alphabetic characters (referred to as ks) every 
actuation of shifting and obtaining the character array of the sequential aforementioned 
predetermined die length (k) The character-array search method according to claim 7 
characterized by extracting the subset of said retrieval character array by carrying out until it 
stops containing the part which connected the both ends in said annular retrieval character 
array in a character array. 

[Claim 9] The character-array search method according to claim 8 with which said 
predetermined die length (k) and die length of said number of alphabetic characters (ks) are 
characterized by the equal thing. 

[Claim 10] From claim 7 to either or a character-array search method according to claim 9 of 
claim 2 to claim 5 characterized by making the number of the bit information storing fields of 
said character-array component table fewer than the number of classes of the character array 
in which said combination is possible by mapping the code of a character array by the Hash 
Function in the code of entries fewer than the number of classes of the character array in which 
combination is possible. 

[Claim 1 1] The character-array search method according to claim 1 with which said character 
array is characterized by expressing the base sequence of DNA or RNA. 

[Claim 12] The character-array search method according to claim 1 with which said character 
array is characterized by expressing an amino acid sequence. 

[Claim 1 3] The character-array search method according to claim 1 with which said character 
array contains a notation alphabetic character and a pictorial symbol. 

[Claim 14] The character-array search method according to claim 6 with which said character 
array is characterized by expressing the base sequence of DNA or RNA. 

[Claim 15] The character-array search method according to claim 6 with which said character 
array is characterized by expressing an amino acid sequence. 

[Claim 1 6] The character-array search method according to claim 6 with which said character 
array contains a notation alphabetic character and a pictorial symbol. 
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[Claim 1 7] In the character-array search method which searches the retrieval character array 
specified out of the character-array database with which two or more character arrays were 
registered (1) The step which creates the character-array component table with which the 
partial character array which the predetermined die length (referred to as k) contained in said 
registration character array follows is included without duplication, and the information about 
these partial character array is expressed, (2) The step which doubles sard registration character 
array and said character-array component table, and is registered to a character-array 
database, (3) The step which carries out the multiple-times (referred to as i) extract of the 
subset of a retrieval character array by the predetermined approach from the character array of 
said predetermined die length (k) contained in said retrieval character array, and creates two or 
more subsets of (i) from said search string, Said subset from which two or more (i) differs is 
alike, respectively, and receive and said character-array component table is referred to for said 
registration character array containing more character arrays in said each subset than the fixed 
numbers defined with the predetermined rate of error allowance. (4) — The step which searchs 
the character-array component table for extracting the character array of less than of said rate 
of error allowance by said all subsets, (5) Character-array search method characterized by 
having the step which searchs the character array for extracting said registration character 
array of less than of said rate of error allowance with reference to said registration character 
array obtained by the step which searchs said character-array component table. 
[Claim 18] Said character-array component table is received at all the combination of said 
partial character array and said registration character array. The bit information storing field 
where the 1-bit information showing whether said partial character array is contained in said 
registration character array and that information which is not included is stored is given. The 
step which searchs said character-array component table is a character-array search method 
according to claim 1 7 characterized by referring to said bit information storing field for said 
every registration character array about said subset. 

[Claim 1 9] Said character-array database is a character-array search method according to claim 
17 or 18 characterized by overiapping more than predetermined die length mutually, and including 
the partial character array of wrap plurality for the overall length of said registration character 
array from each of said registration character array. 

[Claim 20] In the step which creates the subset from which two or more (referred to as i) differs 
from the character array of said predetermined die length (referred to as k) contained in said 
retrieval character array (*♦) — actuation of shifting the number of alphabetic characters 
(referred to as ks) every from the alphabetic character by the side of one end of said retrieval 
character arrays, and obtaining the character array of the sequential aforementioned 
predetermined die length (k) with the step which creates the subset of one multiple-times deed 
(**) — with the step which performs actuation of shifting the number of alphabetic characters 
(referred to as kn) from the alphabetic character in said retrieval character array which started 
the extract of the character array of said predetermined die length (k) in said subset, and 
extracting the character array of said predetermined die length (k), and creates a new subset 
(Ha) The character-array search method according to claim 1 7 characterized by having the step 
which creates the subset which repeats procedure actuation of step (**) further and consists of 
a character array of said predetermined die length (k). and from which two or more (i) differs. 
[Claim 21] The character-array search method according to claim 20 with which said 
predetermined die length (k) and die length of said number of alphabetic characters (ks) are 
characterized by the equal thing. 

[Claim 22] The character-array search method according to claim 20 or 21 with which said 
number of alphabetic characters (kn) is characterized by being equal to 1. 
[Claim 23] In the step which creates a subset [ two or more (referred to as i) ] from said 
retrieval character array The character array divided into predetermined every [ from which two 
or more (i) differs ] die length (referred to as k1. — , ki) By performing actuation which shifts the 
number of alphabetic characters (referred to as ks1, — , ksi) every, carries out a multiple-times 
extract, and creates a subset in which two or more (i) differs from the alphabetic character by 
the side of one end of said retrieval character arrays The character-array search method 
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according to claim 1 7 characterized by creating the subset of said number (i) which consists of a 
character array of different predetermined die length (ki) two or more (i pieces). 
[Claim 24] In the step which creates the subset which consists of a character array of each die 
length (k1, — , ki) to the predetermined die length (referred to as kl. — , ki) from which two or 
more (referred to as i) differs The character array of predetermined die length (ki) the number of 
alphabetic characters (referred to as ksi) every from the alphabetic character by the side of one 
end of said retrieval character arrays After carrying out a multiple-times extract and creating 
one subset, shifting, 
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* NOTICES * 

Japan Patent Office is not responsible for any 
damages caused by the use of this translation. 

1 This document has been translated by computer. So the translation may not reflect the original 
precisely. 

2.**** shows the word which can not be translated. 
3.1n the drawings, any words are not translated. 



DETAILED DESCRIPTION 



[Detailed Description of the Invention] 
[0001] 

[Industrial Application] This invention relates to the search method for the character array, 

especially own database of an array of DNA, RNA, or amino acid. 

[0002] 

[Description of the Prior Art] the case where retrieval which permitted the error to the 
character array, especiially own database of an array of DNA. RNA, or amino acid as 1st 
conventional technique is performed — a dynamic programming (DP) — the comparison by a 
Smith-water man's (Smith-Waterman) approach based on law was performed to all the arrays in 
a database. (A protein nucleic-acid enzyme. 1983, the 28th volume. No. 10. 1165 pages -1186 
pages) . The approach of Smith-Waterman is an approach of asking for a juxtaposition from 
which two character arrays are juxtaposed and the grand total of a score becomes max after 
giving the score of the minus of the score of plus to an inequality, deletion, and insertion to 
coincidence of an alphabetic character, when comparing two character arrays. The registration 
number of cases of the base sequence to GenBank (GenBank) which is the public database of 
DNA is increasing every year, and the whole number of bases amounts to 1x108 in current. Even 
if the time amount which searches the whole database of GenBank with the approach of Smith- 
Waterman uses the number of mainframes, this thing is known for several hours or more. 
[0003] Then, as 2nd conventional technique, in order to shorten retrieval time, generally the 
approach called FASUTOE (FASTA) has come (proceeding OBU National AKADEMII Science 
U.S.A. (Proc.NatLAcad.Sci.USA), 1988. Vol.85, pp 2444-2448) to be used This approach consists 
of two steps of retrieval. In the Ist-step retrieval, all the arrays in a database and the simplified 
comparison are performed. The coincidence part of the substring of fixed alphabetic character 
length (1-6) is extracted between two arrays, if a lap is among those coincidence character 
strings, it extends, and a score value is given according to the die length of the obtained greatest 
coincidence part. The 2nd-step retrieval is performed only to the array beyond the threshold to 
which this score value was set. In the 2nd-step retrieval, a score value is strictly calculated 
using the approach of Smith-Waterman mentioned above. Based on this score value, the 
judgment of the homology between base sequences or similarity is performed. By the approach 
of Smith-Waterman used here, retrieval time is shortened by performing a comparison in the 
limited range centering on the maximum coincidence part obtained by the 1 st-step retrieval. 
[0004] The full text search replaced with retrieval of the registered keyword in retrieval of a 
common document database as 3rd conventional technique on the other hand is performed in 
recent years. Since retrieval time becomes huge, the attempt of improvement in the speed is 
made and a full text search has the retrieval method of the multistage story of a PURISACHI 
type as an effective approach (JP.04-274557.A). In this approach, the alphabetic character 
component table which described what kind of alphabetic character is beforehand contained in 
each document in a database, and the condensation text excluding the particle from each 
document are created, at the time of retrieval, it narrows down by retrieval of as opposed to 
[ narrow down and ] the condensation text next by the alphabetic character component table, 
and strict retrieval is performed to the beginning at it using an automaton to the document 
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narrowed down at the end. 
[0005] 

[Problem(s) to be Solved by the Invention] The description of retrieval by the approach of 
Smith-Waterman which is the conventional technique of the above 1 st is the point that the 
range of error allowance can be changed, by changing the value of a score established as a 
threshold. If the more smallish score value of a threshold is taken, the retrieval based on the 
similarity between DNA of a class which is completely different, for example is possible. 
Conversely, if the larger score value of a threshold is taken, the retrieval which permitted the 
error of precision extent of a DNA array, i.e., identity retrieval, will be attained In this approach, 
the similarity retrieval in consideration of coincidence of an alphabetic character, an inequality, 
deletion, and insertion is possible. It is that there is a problem that the comparison of the 
alphabetic character of the count proportional to square of array length is performed, and 
retrieval time becomes huge to a large-scale database, in the comparison of 1 gyration as a 
fault. 

[0006] FASTA which is the conventional technique of the above 2nd — also in law, it can be 
used for both similarity retrieval and homology retrieval by adjusting the size of the score 
threshold in the Ist-step retrieval. In FASTA, the unrelated array in which partial coincidence 
does not exist by the Ist-step retrieval, either was eliminated, and improvement in the speed is 
attained by narrowing down the number of arrays searched strictly. The retrieval rate of the 
whole GenBank by FASTA is about several minutes, when the number of mainframes is used. 
Thus, although high-speed retrieval is attained quite practical in FASTA, it is known that there is 
a certain amount of omission in retrieval in the 1 st-step retrieval as an inadequate point. Even if 
whenever [ partial coincidence ] is bad on the average, an array whenever [ coincidence ] is high 
on the whole may be dropped on FASTA. In order to lose a leak, when the score threshold was 
lowered, there was a problem that the effectiveness of narrowing down worsened and the whole 
retrieval rate became slow. 

[0007] I hear that one of the descriptions of the approach of the conventional technique of the 
above 3rd does not have the omission in retrieval in narrowing down of each phase, and there is. 
Moreover, high-speed retrieval is enabled by shaking many unrelated documents off by narrowing 
down of each phase, and reducing the count which performs strict retrieval which time amount 
requires. However, this approach had the problem that the sentence made to derive under a 
fixed regulation from a retrieval sentence or a retrieval sentence could search to a common 
document only in a perfect match. Therefore, retrieval after permitting those errors was not able 
to be performed to the database which consists of a character string containing the probable 
error based on experimental errors, such as a DNA sequence. 

[0008] Even when the problem explained above is solved and it is aimed at the character^array 
database of practical use magnitude, especially the public database of a DNA sequence or an 
amino acid sequence, the object of this invention is sufficient short retrieval time which can be 
permitted practically, and is to offer the character-array search method which a retrieval result 
without the omission in retrieval is obtained, and makes all the character strings of an array 
applicable to retrieval and which permits an error. 
[0009] 

[Means for Solving the Problem] The description of this invention is in the 1st character-array 
search method containing each following processing step of (1) to (6). 
[0010] (1) The step which stores character-array data. 

[001 1] (2) The step which creates the character-array component table with which the partiail 
character array which the predetermined die length (referred to as k) contained in said 
registration character array follows is included without duplication, and the information about 
these partial character array is expressed. 

[0012] (3) The step which doubles a registration character array and a character-array 
component table, and is registered to a character-array database. 

[0013] (4) The step which extracts the subset of a retrieval character array from the character 
array of the predetermined die length (k) contained in the retrieval character array specified by a 
retrieval person by the predetermined approach. 
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[0014] (5) The step which searchs the character-array component table for extracting the 
character array of less than of the rate of error allowance for the registration character array 
containing many character arrays in a subset with reference to a character-array component 
table rather than the fixed numbers defined with the predetermined rate of error allowance. 
[0015] (6) The step which searchs the character array for extracting the registration character 
array of less than of the rate of error allowance with reference to the registration character 
array obtained by the step which searchs a character^array component table. 
[0016] Moreover, the description of this invention is in the 2nd character-array search method 
containing each following processing step of (1) to (6). 
[0017] (1) The step which stores character-array data. 

[0018] (2) The step which creates the character-array component table with which the partial 
character array which the predetermined die length (referred to as k) contained in a registration 
character array follows is included without duplication, and the information about these partial 
character array is expressed. 

[0019] (3) The step which doubles a registration character array and a character-array 
component table, and is registered to a character-array database. 

[0020] (4) The step which creates two or more subsets from which the partial character array of 
predetermined die length (k) is extracted based on a predetermined approach, and an element 
differs from the retrieval character array specified by a retrieval person. 

[0021] (5) The step which searchs the character-array component table beforehand created in 
two or more subsets of all in order to extract the registration character array containing more 
partial character arrays in each subset than the fixed numbers defined with the predetermined 
rate of error allowance from a character-array database. 

[0022] (6) The step which searchs the character array for extracting the registration character 

array of less than of the rate of error allowance with reference to the registration character 

array obtained by the step which searchs a character-array component table. 

[0023] Furthermore, the description of this invention is in the 3rd character-array search 

method containing each following processing step of (1) to (6). 

[0024] (1) The step which stores character-array data. 

[0025] (2) The step which creates two or more character-array component tables showing the 
appearance information for every partial character array that the predetermined die length 
(referred to as k) extracted from the registration character array continues, with which die- 
length k differs. 

[0026] (3) The step which doubles a registration character array and a character-array 
component table, and is registered to a character-array database. 

[0027] (4) The step which creates two or more subsets from which the partial character array of 
two or more predetermined die length (k) is extracted based on a predetermined approach, and 
an element differs from the retrieval character array specified by a retrieval person. 
[0028] (5) The step which searchs the character-array component table beforehand created in 
two or more subsets of all in order to extract the registration character array containing more 
partial character arrays in each subset than the fixed numbers defined with the predetermined 
rate of error allowance from a character-array database. 

[0029] (6) The step which searchs the character array for extracting the registration character 
array of less than of the rate of error allowance with reference to the registration character 
array obtained by the step which searchs a character-array component table. 
[0030] There is the description also in performing same processing to the annular character 
array which connected the both ends of a character array in (1) in the above-mentioned step, 
and (2), and performing the step of (3) to (6) similarly in the above character-array search 
method [ the 1 st to 3rd ]. 
[0031] 

[Function] A hierarchical PURISACHI means to narrow down by retrieval of the character-array 
component table with which the character array of predetermined die length was registered, and 
to perform character-array retrieval after that is established, moreover, as criteria of the number 
of hit array components at the time of extracting a subset out of the character array in a 
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retrieval character array, and searching a character-array component table using a subset In the 
case of narrowing down by retrieval of a character-array component table By using the number 
determined based on the rate of error allowance which a retrieval person gives beforehand, a 
character array which is different exceeding the rate of error allowance from the given retrieval 
character array can be omitted before referring to a character array, and the amount which 
searches the character array for retrieval can be lessened. That is, it is possible to shorten the 
processing time which the whole retrieval takes, not to leak and to search the character array 
within the rate of error allowance from the given retrieval character array, by reducing the 
processing time which retrieval of a character array with the high rate of occupying to the 
retrieval processing time takes. Moreover, it is possible to raise the rate of narrowing down 
further and to shorten retrieval time by searching to the alphabetic character component table 
which created the annular character array which connected the both ends of a retrieval 
character array to the annular character array which connected the both ends of a character 
array. 

[0032] Moreover, by creating two or more subsets from which an element differs by the 
predetermined approach, searching using each subset, and using the approach of selecting the 
registration character array which fulfills retrieval conditions to all those subsets, in case 
retrieval which referred to the character-array component table is performed, in a character- 
array component table search, a character array can be narrowed down more, and the number of 
character arrays which performs retrieval which refers to the character array itself can be 
decreased Therefore, by the approach of creating two or more subsets and using for retrieval, 
the retrieval processing time can be shortened more. Moreover, by searching to the character- 
array component table which created the subset which consists of two or more partial character 
arrays from which the element created from the annular character array which connected the 
both ends of a retrieval character array differs from the character array which connected the 
both ends of a character array, narrow down further, a rate is made to improve and retrieval time 
can be shortened. 
[0033] 

[Example] Hereafter, the character-array retrieval equipment and the example to which the 
character-array search method of this invention is applied are explained. 
(Example 1) The 1st example of this invention is hereafter explained using drawing 1 . This 
equipment consists of a file 106 for storing of a display 100, a keyboard 101, a central control 
unit CPU 102, the character-array component table 104, and a character array 103, a floppy disk 
driver 105, and main memory 200. 

[0034] While the character-array registration program 201. the character-array component 
tabulation registration program 202, the error allowance character-array component table search 
program 203, the error allowance character-array search program 204. and the hierarchy 
retrieval control program 206 are stored, the data area 205 is secured to main memory 200. 
These programs are performed by CPU 102. 

[0035] In the case of registration of a character array, it stores in a file 106 by making into a 
character array 1 03 the character-array data which read the character array from the floppy 
disk 107 by which CPU 102 is inserted in the floppy disk driver 105, and executed and read the 
character-array registration program 201 with the command inputted from a keyboard 101. Next, 
CPU 102 performs the character-array component tabulation registration program 202, creates 
the character-array component table which collected without duplication the alphabetic 
character components of predetermined die length used in the character array 103, and stores it 
in a file 106 by making this into the character-array component table 104. 

[0036] In the case of retrieval, the retrieval character array inputted from the keyboard 101 and 
the rate of allowance of a retrieval error are sent to CPU 102. In CPU 102, the hierarchy retrieval 
control program 206 is performed first, and sequential execution of the character-array 
component table search program 203 and the character-array search program 204 is carried out 
based on the control. In a character-array component table search, only the character array in 
which the character-array component in a retrieval character array is contained more than a 
predetermined number based on the rate of allowance of a retrieval error is extracted. And the 
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character-array search to the character array extracted by the character-array component 
table search is performed, only what fills the rate of allowance of a retrieval error is extracted, 
and it outputs as a retrieval result. The above is the outline of character-array retrieval 
equipment of performing the character-array search method of this invention. 
[0037] Registration and the search method of the character-array component table search 
which permitted hereafter the error which is the description of this invention, character-array 
searches, and those hierarchical Puri search methods are explained as a typical example to 
which error allowance retrieval becomes important taking the case of retrieval of a DNA 
sequence. The content of processing of the registration of a DNA sequence and the creation 
registration of a character-array component table to drawing 2 is shown. First, registration 300 
of the DNA sequence of DNA sequences 1 and 2 and — N itself is performed. The base 
sequence of DNA can be expressed with the list of four kinds of base alphabetic characters. 
Adenine A, Cytosine C. Guanine G, and Thymine T, as shown in drawing 2 . Next, the extract 301 
of the character^array component from the registered DNA sequence is performed. The extract 
of the character-array component from a DNA sequence is performed by shifting one base at a 
time and extracting the base sequence component of predetermined, fixed die length (it 
considers as 6 base length in this case) until it reaches another end from one end of a DNA 
sequence, as shown in drawing 2 . Next, DNA sequence alphabetic character component 
tabulation 302 is performed using the base sequence component extracted in this way. A DNA 
sequence alphabetic character component table is expressed for the 1-bit information over all 
possible base sequence component kinds (in this case since it is 6 base long sentence spacing- 
of-letters train component the number of component kinds 6 power [ of 4 ] = 4096). That is, '1' 
is set as the term corresponding to the base sequence component extracted in the DNA 
sequence alphabetic character component table, and '0' is set as the other term. In the example 
in drawing 2 , since the base sequence component AAAAAA does not exist in DNA sequencei, '0' 
is set to the term of AAAAAA in a base sequence alphabetic character component table. 
Moreover, since the base sequence components AAAAAC, AAAACO, and TTTTTT exist, '1' is 
set to the term of AAAAAC in a DNA sequence alphabetic character component table, AAAACO, 
and TTTTTT. Registration 304 to the database of the DNA sequence alphabetic character 
component table created by doing in this way by the last is performed. 
[0038] At the time of retrieval, it searches with reference to the DNA sequence alphabetic 
character component table created as shown in drawing 3 . First, the input 400 of the rate m of 
error allowance at the time of a retrieval base sequence and retrieval is performed. The rate m 
of error allowance at the time of retrieval is set up according to the precision of the inputted 
retrieval base sequence and the base sequence in a database. It is known that a difference 
exists between the base sequences determined as the actual base sequence according to the 
reading error of the experimental data at the time of base sequence determination. The precision 
of a base sequence is determined by the degree of this difference. Therefore, what is necessary 
is to acquire the precision information on a base sequence by experiment beforehand, and just to 
determine the rate of error allowance at the time of retrieval using it. What is necessary is just 
to set up 5 - 1 0% or less of value as a rate of error allowance, in order to judge identity although 
it depends for the precision of a base sequence on the approach of a base-sequence- 
determination experiment etc. Next, the extract 401 of the array component from a retrieval 
array is performed. While array length shifts the array component of k base length (the inside of 
drawing 6 base length) k base length every by making one end into a starting point to the 
retrieval array of Nk. as long as the character-array component of k base length is obtained to 
another end, it extracts without allowing duplication and a gap. A number is given to an extract 
array component in order of an extract (from i= 1 to i=Ne). Next, retrieval 402 to the already 
registered DNA sequence component table is performed using the extracted array component. 
This retrieval is performed as follows, as first shown in drawing 3 . in a DNA sequence 
component table, what took the sum for the value ft of the term corresponding to the array 
component kind extracted from the retrieval array from i= 1 to i=Ne namely. — about all the 
extract array components is set to S. Retrieval hit conditions have the value of S equal to Ne- 
m-Nk, or are expressed as the case of being large. As the number of a retrieval error being fixed. 
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the error of the number of the alphabetic character component of 0 becoming [ inner fi of the 
array component kind extracted from the retrieval array ] max is the case where it is distributed, 
a piece every on each alphabetic character component. When there will be an error of only the 
part below the rate of error allowance, i.e., below a m-Nk individual, from now on, as for the 
maximum of the array component number of 0, fi becomes a m-Nk individual. Therefore, if the 
value which lengthened m-Nk from the number Ne of an array component kind is set as the 
threshold of a retrieval hit, when [ all ] there is an error below the rate of error allowance, it fills, 
and it will not leak and retrieval hit conditions can be searched 

[0039] Thus, retrieval 403 next according [ the base sequence searched by the DNA sequence 
component table ] to the array itself is performed, and the retrieval result 404 is outputted. 
Here, it is appropriate to use the score count during the array by the approach of Smith- 
Waterman based on a dynamic programming. The approach of Smith-Waterman is an approach of 
asking for a juxtaposition to which a suitable score value is given to to the deletion of an array 
alphabetic character, insertion, a permutation, and a match, the juxtaposition during an array is 
performed, and the grand total of a score value becomes max. By using the score value in such a 
juxtaposition for the index of the similarity during two arrays, the base sequence below the rate 
of error allowance can be searched correctly. Thus, in this example, retrieval by the base 
sequence component table on the basis of whenever [ between the array components of a 
retrieval base sequence and a database Nakashio radical array / coincidence ] is performed first, 
and many base sequences which are unrelated after permitting a fixed error are sifted. In this 
way, although time amount is taken, it searches only the narrowed-down base sequence by the 
approach of Smith-Waterman in which exact retrieval is possible. By carrying out like this, 
retrieval without the leak which is a high speed and permitted the error is realizable. Below 
estimates the rate of narrowing down by the DNA sequence component table search which 
determines a retrieval rate and which is a big factor, in order to estimate what improvement in 
the speed is possible. When searching a database with which many same base sequences are 
included, it depends for the rate of narrowing down on the number of the same base sequence in 
a database. So, the database of each other consists of unrelated base sequences, and a retrieval 
base sequence and the base sequence to hit consider the case where it does not exist in a 
database here. By carrying out like this, an unrelated base sequence becomes possible 
[ evaluating the probability to hit by chance, namely, for a retrieval noise to arise in retrieval by 
the base sequence component table ]. The following systems are considered as a model of such 
a system. 

[0040] (1) As for the base sequence in a database, die length considers a random array by the 
fixed length Nd. 

[0041] (2) As for a retrieval array, die length considers a random array by the fixed length Nk. 
[0042] The rate RS of narrowing down in this case is calculated as follows. That the number of 
the array component kinds with which T in each array in a base sequence component table is 
set up becomes max is the case where there is no duplication between the array components 
under each array, and this maximum is given of the several Np array component extracted from a 
base sequence. 
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♦ NOTICES* , 

Japan Patent Office is not responsible for any 
damages caused by the use of this translation. 

1 This document has been translated by computer. So the translation may not reflect the original 
precisely. 

2.**** shows the word which can not be translated. 
3.1n the drawings, any words are not translated. 



TECHNICAL FIELD 

[Industrial Application] This invention relates to the search method for the character array, 
especially own database of an array of DNA, RNA, or amino acid. 
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PRIOR ART 



[Description of the Prior Art] the case where retrieval which permitted the error to the 
character array, especially own database of an array of DNA. RNA, or amino acid as 1 st 
conventional technique is performed — a dynamic programming (DP) — the comparison by a 
Smith-water man's (Smith-Waterman) approach based on law was performed to all the arrays in 
a database. (A protein nucleic-acid enzyme, 1983, the 28th volume, No. 10, 1165 pages -1186 
pages) . The approach of Smith-Waterman is an approach of asking for a juxtaposition from 
which two character arrays are juxtaposed and the grand total of a score becomes max after 
giving the score of the minus of the score of plus to an inequality, deletion, and insertion to 
coincidence of an alphabetic character, when comparing two character arrays. The registration 
number of cases of the base sequence to GenBank (GenBank) which is the public database of 
DNA is increasing every year, and the whole number of bases amounts to 1x108 in current. Even 
if the time amount which searches the whole database of GenBank with the approach of Smith- 
Waterman uses the number of mainframes, this thing is known for several hours or more. 
[0003] Then, as 2nd conventional technique, in order to shorten retrieval time, generally the 
approach called FASUTOE (FASTA) has come (proceeding OBU National AKADEMII Science 
U.S.A. (Proc.Natl.Acad.Sci.USA). 1988, Vol.85, pp 2444-2448) to be used. This approach consists 
of two steps of retrieval. In the 1 st-step retrieval, all the arrays in a database and the simplified 
comparison are performed. The coincidence part of the substring of fixed alphabetic character 
length (1-6) is extracted between two arrays, if a lap is among those coincidence character 
strings, it extends, and a score value is given according to the die length of the obtained greatest 
coincidence part. The 2nd-step retrieval is performed only to the array beyond the threshold to 
which this score value was set. In the 2nd-step retrieval, a score value is strictly calculated 
using the approach of Smith-Waterman mentioned above. Based on this score value, the 
judgment of the homology between base sequences or similarity is performed. By the approach 
of Smith-Waterman used here, retrieval time is shortened by performing a comparison in the 
limited range centering on the maximum coincidence part obtained by the 1 st-step retrieval. 
[0004] The full text search replaced with retrieval of the registered keyword in retrieval of a 
common document database as 3rd conventional technique on the other hand is performed in 
recent years. Since retrieval time becomes huge, the attempt of improvement in the speed is 
made and a full text search has the retrieval method of the multistage story of a PURISACHI 
type as an effective approach (JP,04-274557,A). In this approach, the alphabetic character 
component table which described what kind of alphabetic character is beforehand contained in 
each document in a database, and the condensation text excluding the particle from each 
document are created, at the time of retrieval, it narrows down by retrieval of as opposed to 
[ narrow down and ] the condensation text next by the alphabetic character component table, 
and strict retrieval is performed to the beginning at it using an automaton to the document 
narrowed down at the end. 
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EFFECT OF THE INVENTION 



[Effect of the Invention] According to this invention, it narrows down by retrieval of the 
character-array component table where the character array of predetermined die length was 
registered, and a hierarchical PURISACHI means to perform character-array retrieval is 
established after that, moreover, as criteria of the number of hit components at the time of 
selecting a subset from the character arrays of the predetermined die length in a retrieval 
character array, and searching a character-array component table using a subset in the case of 
narrowing down by retrieval of a character-array component table By using the number 
determined based on the rate of error permission which a retrieval person gives beforehand, a 
character array which is different exceeding the rate of error permission from the given retrieval 
character array is omitted before referring to a character array, and the amount which searches 
the character array for retrieval can be lessened. It becomes possible not to leak, and to search 
the character array within the rate of error permission from the given retrieval character array, 
by this, and to refer to a practical speed of response also in a large-scale character-array 
database. 
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TECHNICAL PROBLEM 



[Problem(s) to be Solved by the Invention] The description of retrieval by the approach of 
Smith-Waterman which is the conventional technique of the above 1st is the point that the 
range of error permission can be changed, by changing the value of a score established as a 
threshold. If the more smallish score value of a threshold is taken, the retrieval based on the 
similarity between DNA of a class which is completely different, for example is possible. 
Conversely, if the larger score value of a threshold is taken, the retrieval which permitted the 
error of precision extent of a DNA array, i.e., identity retrieval, will be attained. In this approach, 
the similarity retrieval in consideration of coincidence of an alphabetic character, an inequality, 
deletion, and insertion is possible. It is that there is a problem that the comparison of the 
alphabetic character of the count proportional to the square of array length is performed, and 
retrieval time becomes huge to a large-scale database, in the comparison of 1 gyration as a 
fault. 

[0006] FASTA which is the conventional technique of the above 2nd — also in law, it can be 
used for both similarity retrieval and homology retrieval by adjusting the size of the score 
threshold in the 1 st-step retrieval. In FASTA, the unrelated array in which partial coincidence 
does not exist by the 1 st-step retrieval, either was eliminated, and improvement in the speed is 
attained by narrowing down the number of arrays searched strictly. The retrieval rate of the 
whole GenBank by FASTA is about several minutes, when the number of mainframes is used. 
Thus, although high-speed retrieval is attained quite practical in FASTA. it is known that there is 
a certain amount of omission in retrieval in the 1 st-step retrieval as an inadequate point. Even if 
whenever [ partial coincidence ] is bad on the average, an array whenever [ coincidence ] is high 
on the whole may be dropped on FASTA. In order to lose a leak, when the score threshold was 
lowered, there was a problem that the effectiveness of narrowing down worsened and the whole 
retrieval rate became slow. 

[0007] I hear that one of the descriptions of the approach of the conventional technique of the 
above 3rd does not have the omission in retrieval in narrowing down of each phase, and there is. 
Moreover, high-speed retrieval is enabled by shaking many unrelated documents off by narrowing 
down of each phase, and reducing the count which performs strict retrieval which time amount 
requires. However, this approach had the problem that the sentence made to derive under a 
fixed regulation from a retrieval sentence or a retrieval sentence could search to a common 
document only in a perfect match. Therefore, retrieval after permitting those errors was not able 
to be performed to the database which consists of a character string containing the probable 
error based on experimental errors, such as a DNA sequence. 

[0008] Even when the problem explained above is solved and it is aimed at the character-array 
database of a practical use scale, especially the public database of a DNA sequence or an amino 
acid sequence, the purpose of this invention is sufficient short retrieval time which can be 
permitted practically, and is to offer the character-array search method which a retrieval result 
without the omission in retrieval is obtained, and makes all the character strings of an array 
applicable to retrieval and which permits an error. 
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MEANS 



[Means for Solving the Problem] The description of this invention is in the 1st character-array 
search method containing each following processing step of (1) to (6). 
[0010] (1) The step which stores character-array data. 

[0011] (2) The step which creates the character^array component table which contains without 
duplication the partial character array which the predetermined die length (referred to as k) 
contained in said registration character array follows, and expresses the information about these 
partial character array. 

[0012] (3) The step which doubles a registration character array and a character-array 
component table, and is registered to a character-array database. 

[0013] (4) The step which extracts the subset of a retrieval character array from the character 
array of the predetermined die length (k) contained in the retrieval character array specified by a 
retrieval person by the predetermined approach. 

[001 4] (5) The step which searchs the character-array component table for extracting the 
character array of less than of the rate of error permission for the registration character array 
containing many character arrays in a subset with reference to a character-array component 
table rather than the fixed numbers defined with the predetermined rate of error permission. 
[0015] (6) The step which searchs the character array for extracting the registration character 
array of less than of the rate of error permission with reference to the registration character 
array obtained by the step which searchs a character-array component table. 
[0016] Moreover, the description of this invention is in the 2nd character-array search method 
containing each following processing step of (1) to (6). 
[0017] (1) The step which stores character-array data. 

[0018] (2) The step which creates the character-array component table which contains without 
duplication the partial character array which the predetermined die length (referred to as k) 
contained in a registration character array follows, and expresses the information about these 
partial character array. 

[0019] (3) The step which doubles a registration character array and a character-array 
component table, and is registered to a character-array database. 

[0020] (4) The step which creates two or more subsets from which the partial character array of 
predetermined die length (k) is extracted based on a predetermined approach, and an element 
differs from the retrieval character array specified by a retrieval person. 

[0021] (5) The step which searchs the character-array component table beforehand created in 
two or more subsets of all in order to extract the registration character array containing more 
partial character arrays in each subset than the fixed numbers defined with the predetermined 
rate of error permission from a character-array database. 

[0022] (6) The step which searchs the character array for extracting the registration character 

array of less than of the rate of error permission with reference to the registration character 

array obtained by the step which searchs a character-array component table. 

[0023] Furthermore, the description of this invention is in the 3rd character-array search 

method containing each following processing step of (1) to (6). 

[0024] (1) The step which stores character-array data. 
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[0025] (2) The step which creates two or more character-array component tables where die- 
length k showing the appearance information for every partial character array that the 
predetermined die length (referred to as k) extracted from the registration character array 
continues differs. 

[0026] (3) The step which doubles a registration character array and a character-array 
component table, and is registered to a character-array database. 

[0027] (4) The step which creates two or more subsets from which the partial character array of 
two or more predetermined die length (k) is extracted based on a predetermined approach, and 
an element differs from the retrieval character array specified by a retrieval person. 
[0028] (5) The step which searchs the character-array component table beforehand created in 
two or more subsets of all in order to extract the registration character array containing more 
partial character arrays in each subset than the fixed numbers defined with the predetermined 
rate of error permission from a character-array database. 

[0029] (6) The step which searchs the character array for extracting the registration character 
array of less than of the rate of error permission with reference to the registration character 
array obtained by the step which searchs a character-array component table. 
[0030] There is the description also in performing same processing to the annular character 
array which connected the both ends of a character array in (1) in the above-mentioned step, 
and (2). and performing the step of (3) to (6) similarly in the above character-array search 
method [ the 1 st to 3rd ]. 
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OPERATION 



[Function] A hierarchical PURISAOHI means to narrow down by retrieval of the character-array 
component table where the character array of predetermined die length was registered, and to 
perform character-array retrieval after that is established, moreover, as criteria of the number of 
hit array components at the time of extracting a subset out of the character array in a retrieval 
character array, and searching a character-array component table using a subset in the case of 
narrowing down by retrieval of a character-array component table By using the number 
determined based on the rate of error permission which a retrieval person gives beforehand, a 
character array which is different exceeding the rate of error permission from the given retrieval 
character array can be omitted before referring to a character array, and the amount which 
searches the character array for retrieval can be lessened. That is, it is possible to shorten the 
processing time which the whole retrieval takes, not to leak and to search the character array 
within the rate of error permission from the given retrieval character array, by reducing the 
processing time which retrieval of a character array with the high rate of occupying to the 
retrieval processing time takes. Moreover, it is possible to raise the rate of narrowing down 
further and to shorten retrieval time by searching to the alphabetic character component table 
which created the annular character array which connected the both ends of a retrieval 
character array to the annular character array which connected the both ends of a character 
array. 

[0032] Moreover, by creating two or more subsets from which an element differs by the 
predetermined approach, searching using each subset, and using the approach of selecting the 
registration character array which fulfills retrieval conditions to all those subsets, in case 
retrieval which referred to the character-array component table is performed, in a character- 
array component table search, a character array can be narrowed down more, and the number of 
character arrays which performs retrieval which refers to the character array itself can be 
decreased. Therefore, by the approach of creating two or more subsets and using for retrieval, 
the retrieval processing time can be shortened more. Moreover, by searching to the character- 
array component table which created the subset which consists of two or more partial character 
arrays from which the element created from the annular character array which connected the 
both ends of a retrieval character array differs from the character array which connected the 
both ends of a character array, narrow down further, a rate is made to improve and retrieval time 
can be shortened. 
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EXAMPLE 



[Example] Hereafter, the character-array retrieval equipment and the example to which the 
character-array search method of this invention is applied are explained. 
(Example 1) The 1st example of this invention is hereafter explained using drawing 1 . This 
equipment consists of a file 106 for storing of a display 100, a keyboard 101. a central control 
unit CPU 1 02, the character-array component table 1 04, and a character array 1 03, a floppy disk 
driver 105. and main memory 200. 

[0034] While the character^array registration program 201. the character-array component 
tabulation registration program 202, the error permission character-array component table 
search program 203, the error permission character-array search program 204, and the hierarchy 
retrieval control program 206 are stored, the data area 205 is secured to main memory 200. 
These programs are performed by CPU 102. 

[0035] In the case of registration of a character array, it stores in a file 106 by making into a 
character array 1 03 the character-array data which read the character array from the floppy 
disk 107 by which CPU102 is inserted in the floppy disk driver 105. and executed and read the 
character-array registration program 201 with the command inputted from a keyboard 101. Next, 
CPU 102 performs the character-array component tabulation registration program 202, creates 
the character-array component table in which the alphabetic character component of 
predetermined die length used in the character array 103 was brought together without 
duplication, and stores it in a file 106 by making this into the character-array component table 
104. 

[0036] In the case of retrieval, the retrieval character array inputted from the keyboard 101 and 
the rate of permission of a retrieval error are sent to CPU 102. In CPU 102, the hierarchy retrieval 
control program 206 is performed first, and sequential execution of the character-array 
component table search program 203 and the character-array search program 204 is carried out 
based on the control. In a character-array component table search, only the character array in 
which the character-array component in a retrieval character array is contained more than a 
predetermined number based on the rate of permission of a retrieval error is extracted. And the 
character-array search to the character array extracted by the character-array component 
table search is performed, only what fills the rate of permission of a retrieval error is extracted, 
and it outputs as a retrieval result. The above is the outline of character-array retrieval 
equipment of performing the character-array search method of this invention. 
[0037] Registration and the search method of the character-array component table search 
which permitted hereafter the error which is the description of this invention, character-array 
searches, and those hierarchical Puri search methods are explained as a typical example to 
which error permission retrieval becomes important taking the case of retrieval of a DNA 
sequence. The contents of processing of the registration of a DNA sequence and the creation 
registration of a character-array component table to dra wing 2 are shown. First, registration 300 
of the DNA sequence of DNA sequences 1 and 2 and — N itself is performed. The base 
sequence of DNA can be expressed with the list of four kinds of base alphabetic characters. 
Adenine A, Cytosine C, Guanine G, and Thymine T, as shown in draw ing 2 . Next, the extract 301 
of the character-array component from the registered DNA sequence is performed. The extract 
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of the character-array component from a DNA sequence is performed by shifting one base at a 
time and extracting the base sequence component of predetermined, fixed die length (it 
considers as 6 base length in this case) until it reaches another end from one end of a DNA 
sequence, as shown in drawing 2 . Next, DNA sequence alphabetic character component 
tabulation 302 is performed using the base sequence component extracted in this way. A DNA 
sequence alphabetic character component table is expressed for the 1-bit information over all 
possible base sequence component kinds (in this case since it is 6 base long sentence spacing- 
of-letters train component the number of component kinds 6 power [ of 4 ] = 4096). That is, 'T 
is set as the term corresponding to the base sequence component extracted all over the DNA 
sequence alphabetic character component table, and '0' is set as the other term. In the example 
in drawing 2 , since the base sequence component AAAAAA does not exist in DNA sequence!, '0' 
is set to the term of AAAAAA in a base sequence alphabetic character component table. 
Moreover, since the base sequence components AAAAAC, AAAACC, and TTTTTT exist, T is 
set to the term of AAAAAC in a DNA sequence alphabetic character component table, AAAACC, 
and TTTTTT. Registration 304 to the database of the DNA sequence alphabetic character 
component table created by doing in this way by the last is performed. 
[0038] At the time of retrieval, it searches with reference to the DNA sequence alphabetic 
character component table created as shown in drawing 3 . First, the input 400 of the rate m of 
error permission at the time of a retrieval base sequence and retrieval is performed. The rate m 
of error permission at the time of retrieval is set up according to the precision of the inputted 
retrieval base sequence and the base sequence in a database. It is known that a difference 
exists between the base sequences determined as the actual base sequence according to the 
reading error of the experimental data at the time of base sequence determination. The precision 
of a base sequence is determined by the degree of this difference. Therefore, what is necessary 
is to acquire the precision information on a base sequence by experiment beforehand, and just to 
determine the rate of error permission at the time of retrieval using it. What is necessary is just 
to set up 5 - 10% or less of value as a rate of error permission, in order to judge identity 
although it depends for the precision of a base sequence on the approach of a base-sequence- 
determination experiment etc. Next, the extract 401 of the array component from a retrieval 
array is performed. While array length shifts the array component of k base length (the inside of 
drawing 6 base length) k base length every by making one end into a starting point to the 
retrieval array of Nk, as long as the character-array component of k base length is obtained to 
another end, it extracts without allowing duplication and a gap. A number is given to an extract 
array component in order of an extract (from i= 1 to i=Ne). Next, retrieval 402 to the already 
registered DNA sequence component table is performed using the extracted array component. 
This retrieval is performed as follows, as first shown in drawing 3 , in a DNA sequence 
component table, what took the sum for the value fi of the term corresponding to the array 
component kind extracted from the retrieval array from i= 1 to i=Ne namely, — about all the 
extract array components is set to S. Retrieval hit conditions have the value of S equal to Ne- 
m-Nk, or are expressed as the case of being large. As the number of a retrieval error being fixed, 
the error of the number of the alphabetic character component of 0 becoming [ inner fi of the 
array component kind extracted from the retrieval array ] max is the case where it is distributed, 
a piece every on each alphabetic character component. When there will be an error of only the 
part below the rate of error permission, i.e., below a m-Nk individual, from now on, as for the 
maximum of the array component number of 0, fi becomes a m-Nk individual. Therefore, if the 
value which lengthened m-Nk from the number Ne of an array component kind is set as the 
threshold of a retrieval hit, when [ all ] there is an error below the rate of error permission, it 
fills, and it will not leak and retrieval hit conditions can be searched. 

[0039] Thus, retrieval 403 next according [ the base sequence searched by the DNA sequence 
component table ] to the array itself is performed, and the retrieval result 404 is outputted. 
Here, it is appropriate to use the score count during the array by the approach of Smith- 
Waterman based on a dynamic programming. The approach of Smith-Waterman is an approach of 
asking for a juxtaposition to which a suitable score value is given to to the deletion of an array 
alphabetic character, insertion, a permutation, and a match, the juxtaposition during an array is 
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performed, and the grand total of a score value becomes max. By using the score value in such a 
juxtaposition for the index of the similarity during two arrays, the base sequence below the rate 
of error permission can be searched correctly. Thus, in this example, retrieval by the base 
sequence component table on the basis of whenever [ between the array components of a 
retrieval base sequence and a database Nakashio radical array / coincidence ] is performed first, 
and many base sequences which are unrelated after permitting a fixed error are sifted. In this 
way, although time amount is taken, it searches only the narrowed-down base sequence by the 
approach of Smith-Waterman in which exact retrieval is possible. By carrying out like this, 
retrieval without the leak which is a high speed and permitted the error is realizable. Below 
estimates the rate of narrowing down by the DNA sequence component table search which 
determines a retrieval rate and which is a big factor, in order to estimate what improvement in 
the speed is possible. When searching a database with which many same base sequences are 
included, it depends for the rate of narrowing down on the number of the same base sequence in 
a database. So, the database of each other consists of unrelated base sequences, and a retrieval 
base sequence and the base sequence to hit consider the case where it does not exist in a 
database here. By carrying out like this, an unrelated base sequence becomes possible 
[ evaluating the probability to hit by chance, namely, for a retrieval noise to arise in retrieval by 
the base sequence component table ]. The following systems are considered as a model of such 
a system. 

[0040] (1) As for the base sequence in a database, die length considers a random array by the 
fixed length Nd, 

[0041] (2) As for a retrieval array, die length considers a random array by the fixed length Nk. 
[0042] The rate RS of narrowing down in this case is calculated as follows. That the number of 
the array component kinds with which '1 ' in each array in a base. sequence component table is 
set up becomes max is the case where there is no duplication between the array components 
under each array, and this maximum is given of the several Np array component extracted from a 
base sequence. Since it is Np=Nd-k +1, it is set to Np=245 when it is Nd=250 (the base 
sequence length usually obtained by base sequence determination has many cases of 250 or 
more), and k= 6. On the other hand, the total of the array component kind in a base sequence 
component table and Na are the k-th power of 4, and. in the case of k= 6, it is Na=4096. 
therefore, the probability P which one array component extracted from the random retrieval base 
sequence hits by chance to T in an array component table — at most — it is Np/Na and is set 
to P<=Np/Na=245 / 4096**0.06 in this case. Since it is thought that there is almost no 
duplication between the array components in a base sequence in Np«Na, you may regard it as 
P**Np/Na=0.06. The array component more than the Ne-m-Nk individual which is a retrieval 
threshold among the array components (number Ne) of which the rate RS of narrowing down is 
extracted from a retrieval base sequence is given to *V in an array component table as a 
probability hit by chance. Since this is a probability produced more than the inner Ne-m-Nk time 
of trial of the event of Probability P of Ne time, it can be expressed with the sum of the following 
Poisson distributions. 



[0044] Since the number Ne of array components extracted from a retrieval base sequence is 
given by the quotient which broke Nk by k, in the case of Nk=250 and k= 6. it is set to Ne=250 / 
6= 41. It will be set to RS**6.5x10-10 if P= 0.06, Ne=41, Nk=250. and k= 6 are substituted for 
(several 1). using the rate m of error permission as 10%. It is as follows when the retrieval time in 
this case is estimated. The retrieval time tdp by the approach of Smith-Waterman is proportional 
to the product of the number N of base sequences in Nk, Nd, and a database by making tdpO into 
a proportionality constant, as shown in (several 2). 



[0043] 
[Equation 1] 




i=Nc-in.Nk 



[0045] 
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[Equation 2] 

tdp = tdpo-N-Nk-Nd 

[0046] On the other hand, the retrieval time t in this approach can be expressed with the sum of 
the retrieval time ttb by the array component table, the retrieval time by the approach of Smith- 
Waterman to the narrowed-down array, and tdp'. Here, t can be expressed with (several 3) if it 
takes into consideration that ttb makes tO a proportionality constant and it is proportional to 
Nk/k and N, and that tdp' narrows down with tdp and it is the product of a rate RS. 
[0047] 
[Equation 3] 

t = to.N.^ + R,.tdpo-N.Nk-Nd 



(&3) 



[0048] Here, if it assumes that it is tdpO**tO, when RS is smaller than 1 -A-Nd, approximating 
with t**tO and N-Nk/k is possible. In the situation considered now. in 1/k-Nd**l / 6- 
250**0.001, since it is RS**6.5x1 0-1 0. this approximation is materialized, and the ratio oft and 
tdp can be expressed with (several 4). 
[0049] 
[Equation 4] 

t _ to • N • Nk _ I 



tdp - k • tdpo • N . Nic • Nd ~ k • Nd 



s 0.001— (M4) 



[0050] Thus, if this approach is used, compared with the approach of Smith-Waterman, retrieval 
of a database is possible at about 1/1000 retrieval time. This improvement in the speed is mainly 
dependent on narrowing down by the array component table search. If the rate of narrowing 
down becomes larger than 1-/k-Nd, it will become t/tdp**RS and retrieval time will increase in 
proportion to the rate of narrowing down. 

[0051] Next, the count result of RS at the time of making it change with k=4-8 was shown in 
drawing 4 to Nkd=1 00-1 000 as Nk=Nd=Nkd and m= 10%. Thus, it turns out that the value km of k 
which narrows down to each of the base length Nkd and makes a rate min exists. km(s) to 
Nkd=100. Nkd=250. Nkd=500. and Nkd=1000 are k= 6, and 7, 7 and 8. respectively. It turns out 
that RS becomes 0.001 or less by k= 5. and 6, 7 and 8 to Nkd=100 and Nkd=250. On the other 
hand, it turns out that RS becomes 0.001 or less by k= 6. and 7 and 8 to Nkd=500 and Nkd=1000. 
Although array component table-search time amount will become short in proportion to it if the 
value of k is enlarged, the required amount of memory increases. Therefore, what is necessary is 
just to set up k within the limits of the above according to the scale of a database. 
[0052] Also when the FASTA method is used, it is narrowing down by the comparison of a partial 
array before retrieval by the approach of Smith-Waterman. Although it is possible to narrow 
down enough if the score threshold at this time is chosen greatly, it is known that the omission 
in retrieval by narrowing down will arise in this case. According to this approach, it becomes 
possible to perform a high-speed search of about 1000 times compared with the approach of 
Smith-Waterman, without such omission in retrieval arising. 

[0053] (Example 2) The 2nd example of this invention is hereafter explained using d rawing 5 . In 
this example, in order to utilize more the array information which a retrieval base sequence has, 
the approach of extracting an array component from a retrieval base sequence is generalized. As 
shown in drawing 5 , the extract 500 of the array component from a retrieval array extracts the 
array component of k base length from one end of (1) retrieval array, shifts a start point ks base 
length in (2) retrieval array, and while the end of (3) array components is settled during a 
retrieval array, it repeats actuation of extracting the array component of k base length. Next, 
retrieval 501 in an array component table search is performed. Retrieval conditions are set up as 
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follows. Retrieval hit conditions have the value of S equal to Ne-f (k. ks) and m-Nk. or are 
expressed as the case of being large. The definition of S, Ne, Nk, and m is the same as that of 
the definition in the 1 st example here, f (k. ks) is a factor in consideration of the part of the lap 
between adjacent k base length array components, and is expressed by (several 5) as a function 
of k and ks. 
[0054] 
[Equation 5] 



k^ks <0^^, f(k,k$)=l 

U(k/2)^ks<k <om^. f(k. ks) = 2 

U(k/3)^ks<U(ky2) <0m^, f(k, ks) =x 3 



1 ^ks<U(k/(k-l)) <0^^, f(k, ks) = k 

t^l-e, U(R)=I 

(R : JE<D^1k. I : R<Dyh§k^.};LT(Dm^m ^ ±inzJE<D^mJ 

[0055] Also when an array error exists in the part of the lap between k base length array 
components which adjoined each other by setting up Function f (k, ks) in this way. it becomes 
possible not to leak and to search the base sequence below the rate m of error permission. The 
result of k having narrowed down to drawi ng 6 by the approach same when smaller than ks as 
the 1st example, and having calculated the rate RS is shown. k= 6 showed the rate dependency 
of error permission about RS of each **** of ks=l-6. In the case of ks=6 (i.e., when it is the 
same as that of the 1st example), the rate of error permission is RS-10-9 at 10%, but if the rate 
of error permission becomes large. RS will increase quickly, the rate of error permission will 
become larger [ RS ] at 13% or more than 0.001, and a retrieval rate will decrease, it is shown in 
drawing 6 — as — ks= — the effectiveness to which in the case of 1 and 5 the rate of 
narrowing down increased very much, and ks was made smaller than k is not acquired, however, 
ks= — in the case of 2 and 3. the rate of narrowing down is decreasing in each rate of error 
permission. In ks=2. 0.001 or less rate of narrowing down is obtained at least m= 14%. Hpwever, 
what is necessary is to narrow down a retrieval rate with this, since only the part decreases, and 
just to determine the value of ks on balance with the rate of increase of a rate by the number of 
array components which should be judged if ks becomes small increasing. Moreover, although the 
value of Ne in retrieval hit conditions will become small, it will narrow down and a rate will 
become large if bigger ks than k is used, a component table-search rate becomes quick. 
Therefore, it is possible to raise the retrieval rate as the whole in this case, since the rate of 
increase of the rate of narrowing down is also small if the small rate of error permission (about 
5% or less) is usable. Thus, in this example, it is possible by choosing the value of ks as a suitable 
value according to the value of the rate of error permission to raise the whole retrieval rate by 
raising the rate of narrowing down rather than the case of the 1st example, or raising a 
component table-search rate. 

[0056] (Example 3) The 3rd example of this invention is hereafter explained using drawing 7 . In 
this example, division 600 which allowed duplication of the base sequence in a database is 
performed, and the DNA sequence component table is created to the divided array. As shown in 
d rawing 7 , to the base sequence in the database of the array length Nd. the array of the 
predetermined, fixed die length Nf is extracted from one end, allowing duplication of the array 
length Ns, and it repeats to another end (the last is the array length Nfe). Ns is set as a larger 
value than the array length of the retrieval base sequence to be used here. Next, creation 601 of 
a base sequence component table to each base sequence which carried out the division extract 
is performed, the serial number in the inside of each base sequence is added to each base 
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sequence which carried out the division extract, and registration 602 to a database is performed. 
Hereafter, the effectiveness of this example is explained. When searching public DNA databases, 
such as GenBank, by this invention, generally the average of the array length of the base 
sequence in a database is 1 000 base length extent. On the other hand, the base sequence used 
for retrieval is die-length extent of the array in which sequencing is possible at once by the DNA 
sequencer, and this is 200 - 400 base length. Thus, when the base sequence length in a 
database is quite larger than retrieval base sequence length, the number of "1" in a base 
sequence component table increases, and it is possible that the rate of narrowing down 
increases. By the same count approach as the 1 st example, the result of having calculated the 
database Nakashio radical array length dependency of the rate of narrowing down is shown in 
drawing 8 . The case of 250 base length was considered as a retrieval base sequence, and the 
base sequence length Nd in a database was changed from 250 to 1500. Although Nd narrows 
down or less by 750 and a rate's is about [ 0.001 or less ] as shown in drawing, Nd narrows down 
or more by 1000, a rate becomes 0.01 or more values and retrieval time increases. Then, if the 
base sequence component table is created to what divided the base sequence in a database 
beforehand and shortened it as shown in this example, it is possible to maintain small the rate of 
narrowing down to the divided base sequence. Although retrieval time increases in proportion to 
the number of partitions, compared with increase of the retrieval time which increase of the rate 
of narrowing down by not dividing causes, it is very small. What is necessary is to set up 500- 
750 as division array length Nf of a database Nakashio radical array, and just to set up 250-375 
as direct repeat length from drawing 8 , when retrieval array length is 250 or less. Thus, by 
setting up, it is possible to include the whole retrieval array in the divided array, and it is possible 
to narrow down in that case and to make a rate or less into 0.001. What is necessary is just to 
output the serial number in the base sequence of the origin of them, since the array which surely 
adjoined each other hits when hitting. 

[0057] (Example 4) The 4th example of this invention is hereafter explained using drawing 9 . In 
this example, in case an array component is extracted from a retrieval base sequence and the 
base sequence in a database, the information on an array end is used effectively. As shown in 
drawing 9 (a), in case one end of a retrieval base sequence is left and the array component is 
extracted, in the extract from a retrieval base sequence, the base sequence which connected 
the end of the beginning of a retrieval base sequence to another end is considered, and it 
extracts array components including all the components containing this node. In case the array 
component is extracted leaving one end of a base sequence and shifting one base at a time, as 
shown in drawing 9 (b) also in the extract from the base sequence in a database, the base 
sequence which connected the end of the beginning of a retrieval base sequence to another end 
is considered, and array components including all the components containing this node are 
extracted. 

[0058] When only the thing same as a retrieval base sequence as the base sequence in a 
database may be considered by using the approach of this example (i.e., when a retrieval base 
sequence is not partially included in the base sequence in a database), it is possible to utilize 
effectively the information on the end of a base sequence which was not used by the approach 
of the 1st and 2nd example. For example, Nk=250, Nd=250. k= 6, ks=6, and m= 10% of case, the 
rate of narrowing down in the approach of an example 1 is set to RS**6.5x10-10, as the 1st 
example showed. On the other hand, the probabilities P which the component of a piece will hit if 
this example is used are P= 250 / 4096= 0.061, and since 42 and the number of hit judging 
components are 1 7, if the number of extract components from a retrieval base sequence is 
calculated using (several 1), it will be set to RS** 1.4x1 0-10. Thus, it turns out that it narrowed 
down compared with the approach of the 1 st example, and the rate has improved to about 1 /5. 
[0059] (Example 5) The 5th example of this invention is hereafter explained using drawing 10 . In 
this example, the capacity of an array component table is reduced by the hashing technique 
using frequency information. In order to create the array component table of the hashing mold 
using frequency information, the operating frequency of the array component within the base 
sequence registered into the database is investigated, and frequency information determines a 
Hash Function. About a component with large frequency, the number of components 
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corresponding to the same entry is lessened, and the number of components corresponding to 
the same entry is made [ many ] about a component with small frequency. As shown in drawing 
10 , the frequency distribution 700 in the inside of the database of each array component kind is 
investigated, and, specifically, the frequency distribution 701 which rearranged the array 
component kind in order of the frequency is created. And as shown in a hashing method 702, a 
component kind with small frequency is made equivalent to a component kind with high 
frequency as the arrow head for example, in drawing shows, and the same entry number is given, 
as a result of the frequency of each entry becoming about 1 law and always obtaining the fixed 
rate of narrowing down by carrying out like this, stable retrieval time is guaranteed. Moreover, 
when the capacity of an array component table is reduced by hashing, there is the following two 
effectiveness. First, it is being able to treat now the base sequence which uses the value of the 
same k in more cases in an array component table. Moreover, when treating the base sequence 
of the same number in an array component table, it becomes possible to use the value of bigger 
k. Bigger k shortens the retrieval time in an array component table search, and makes more 
nearly high-speed retrieval realizable. 

[0060] One example of hashing is the following cases. 1 1 kinds of base alphabetic characters are 
used for the base alphabetic character in a base sequence database besides A. C. G. and T. This 
is used when a base alphabetic character cannot be decided to A, C, G. or T at the time of base 
sequence determination, and it is distinguished according to the degree of indefiniteness. In the 
usual base sequence determination, the frequency of occurrence of these base alphabetic 
characters is very small, and is about 1/100 compared with the frequency of occurrence of A, 0, 
G, and T. then, if the array component which changed them into A, 0, G, or T is created and an 
array component table is created when base alphabetic characters other than A. 0, G, and T 
appear in an array component, it is possible to boil the capacity of an array component table 
markedly and to make it small, without increasing most rates of narrowing down. At the time of 
retrieval, after changing the array component extracted out of the retrieval base sequence by 
this conversion, it searches by the array component table. 

[0061] (Example 6) The 6th example of this invention is explained using drawing 1 1 . Here, the 
approach of creating two or more subsets which consist of an array component of 
predetermined die-length k from a retrieval array is generalized. In this example, the extract 801 
of the array component from a retrieval array is performed, as shown in drawing 12 , That is, 
creation of the subset of plurality (referred to as n) from which the element which consists of an 
array component of die-length k differs follows the following procedure. 
[0062] (1) Extract the array component of k base length from one end of a retrieval array. 
[0063] (2) Shift the location which started the extract of the array component of k base length 
under retrieval array ks base length, and extract the array component of k base length. 
[0064] While the end of an array component is contained during a retrieval array in procedure 
actuation of (3) and (2). one subset is created repeatedly. 

[0065] (4) Shift the array component of k base length under retrieval array kn base length from 
the location which started the extract, repeat the procedure actuation from (1) to (3), and create 
a new subset. 

[0066] (5) When creating the subset after the 3rd, from the location under retrieval array which 
started the extract of the array component of k base length in the subset created immediately 
before, shift kn base length and repeat a series of procedure actuation from (1) to (3). 
[0067] n subsets to which only n counts which had this procedure actuation specified are 
performed and which they become from a different array component are created. Furthermore, 
the search method which uses the created subset is generalized. The upper limit of the value of 
this n (the number of subsets) is set up as follows. Creation of a subset is stopped when the 
array component of the base length k which is the element of a subset comes to be altogether 
contained in the element of the subset already created. Namely, if the location of one edge of a 
retrieval array to knx (n-1) alphabetic character gap ****** of the extract starting position of 
the array component of the n-th subset which consists of an array component of the base 
length k is a kxd-kn (d is positive number) alphabetic character eye, since the n-th subset is 
contained in the first subset, creation of a subset is stopped at this time. In other words, only 
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the number of the minimum n with which a knxn alphabetic character eye serves as a multiple of 
k creates a subset. For example, three subsets are made, when k is six characters and kn is four 
characters. Moreover, if kn is one character, k subsets are made, and if k of six characters 
becomes, they will be created six pieces. Thus, the extract 801 of an array component is 
performed n times. 

[0068] Next, the retrieval 802 using n created different subsets is explained. The method of 
searching to each subset uses the approach of a publication for an example 2. The retrieval hit 
conditions at the time of creating and searching n subsets are shown by following (several 6). 
[0069] 
[Equation 6] 

S^Nein^-'fiKks^^m^Nj^n) (as) 

[0070] The retrieval array length Nk in a said division part set (n) is defined as the number Ne of 
array components (n) extracted from the retrieval array in the n~th subset here by Nk(n) =Nk- 
knx (n-1) and Ne(n) =Nk(n)/k. In the above retrieval hit conditions, judgment 803 is performed for 
every subset, only the array with which hit conditions are filled to all subsets is extracted in an 
array component table search, array retrieval 804 which is retrieval of a next step story is 
performed to the extracted array, and the retrieval result 805 is obtained. 
[0071] Retrieval processing-time t (n) using n subsets by this example is expressed with the 
formula (several 7) shown below. Here, Rs (n) expresses the rate of narrowing down when using 
this example, and that of other inner (several 7) variables is the same as that of what was 
defined in the example 1. 
[0072] 
[Equation 7] 



[0073] In order to search an array component table with this example for every n created 
subsets, the time amount ttb of an array component table search increases in the example 2 n 
times compared with the approach of a publication. The retrieval time in this example is 
shortened rather than the retrieval time in an example 1, namely, when t (n) becomes small 
rather than t, the conditions to which the rate of narrowing down of the retrieval in this example 
is given by following (several 8) are fulfilled. If the conditions of the rate of narrowing down 
shown by (several 8) are fulfilled, the retrieval processing time will be shortened. 
[0074] 
[Equation 8] 



/?j(«)^^s""^7T7-. 



[0075] Next, the rate of narrowing down in this example is shown. In this example, the rate of 
narrowing down is expressed as follows. That is, if it thinks simply, it narrows down, and the rate 
Rs (n) will be the same as that of the value which was acquired as a result of searching in n 
subsets, respectively and which is expressed with (several 1 ). and will turn into a rate of 
narrowing down at the time of [ which was narrowed down and the product of a rate Rs (n) 
searched using n subsets ] asking in each subset. The result of having compared, with drawing 13 
the rate of narrowing down in the retrieval using the subset from which plurality differs which is 
the search method of this example, and the rate of narrowing down in a search method given in 
an example 2 is shown. The result of having referred to the conditions shown below is expressed 
with drawin g 13 . 

[0076] The used character array used the array data in an actual database. The database used 
GBPRI.SEQ which collected the gene sequences of the primates in GenBank (release 74.0) 
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which is a public database. The number of 20x106 bases is registered on the whole, it is made 
the number of arrays and about 20.000 arrays are registered into GBPRI.SEQ. That with which 
the die length of an array does not fill 1000 bases to the array of this database excluded, 
arranged die length with 1000 bases about the array longer than 1000 bases, and set 500 arrays 
as the object of retrieval. The array component table was created to such an array database to 
the base component of the die length from three characters to ten characters. The approach of 
this array component tabulation Is based on an approach given in an example 1 , 
[0077] Moreover. 400 arrays which do not overlap the array in the above-mentioned database 
from the database of the GBPRI.SEQ same as an array which searches were used. About each 
array. It was calculated, and narrowed down and the average of a rate was calculated. 
[0078] At the time of retrieval, the array component was extracted from the retrieval array on 
the basis of the following conditions. The array length Nd of a database considered as 1000 
bases, and made retrieval array length 100 bases. Moreover, the variable at the time of 
extracting the array component of fixed length from a retrieval array is set up as follows. An 
approach given in an example 2 also shifts only the ks alphabetic character, in case all of this 
example cut off the array component of the fixed length k. These values were made into k=ks at 
the time of actual retrieval. Moreover, the value of kn used was made into one character only in 
this example. Compared with the case where kn takes other general values, the number of the 
subsets to create is max. and this is because it thought that finest retrieval could be performed. 
If the value of kn is set up with one character, the number of the subsets about the array 
component of k characters will become k pieces. Therefore, the rate of an array of having 
fulfilled retrieval hit conditions in all the 3-10 subsets in this example is calculated. Moreover, 
the rate m of error permission which determines retrieval hit conditions was made into 5%, 
[0079] The above variable is used and the searched result is shown in drawing 1 3 . The rate of 
narrowing down improves rather than the approach given in an example 2. It narrows down, when 
the array component length k is 5, 6. 7. and 8 or 9 characters, and a rate is improved very much, 
and retrieval time is shortened rather than the approach of a publication in the example 2. 
Therefore, this example is effective in improving the rate of narrowing down in an array 
component table search, and shortening retrieval time. In the result of drawing 13 , an 
Improvement of the rate of narrowing down by this example was not found so that it was 
predicted, but since this used live data for retrieval, it is considered because the database was 
not random for the reasons of there being many iterative arrays. 

[0080] (Example 7) Next, the 4th example in this invention is explained using drawing 1 1 and 
drawing 14 . In this example, the extract 801 of the array component from the retrieval array in 
drawing 1 1 is performed by the approach shown in drawing 14 . That is. it is the approach of 
creating the subset which consists of an array component from which the extract 801 of an 
array component is performed about ki of different die length of i pieces, and die length differs. 
That is. the array component of kl base length is extracted from one end of (1) retrieval array to 
the die length kl specified first. 

[0081] (2) In a retrieval array, shift an array component extract start point ksl base length, and 
extract the array component of kl base length. 

[0082] Procedure actuation of (3) and (2) is repeated while the end of an array component is 
settled during a retrieval array. By the above procedure actuation, the subset to the array 
component of the base length kl is created. Next, to different base length k2 from kl specified 
beforehand, procedure actuation to (1) - (3) is performed, and the subset which consists of an 
array component to the base length k2 is created. Thus, the subset which consists of an array 
component corresponding to each specified die length is created. 

[0083] Although the search method for the created subset follows the procedure already 
explained in the example 2. since the base length of the array component which is the element 
changes with each subsets, it also needs to create two or more array component tables for 
every base length beforehand. That is, when the die length of the array component of a subset 
considers as kl, k2, and k3 base length, respectively, it is necessary to also create an array 
component table beforehand to the array component of kl, k2. and k3 base length, respectively. 
It is as having indicated the creation approach of an array component table for each base length 
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in the example 1 . Moreoven in performing an array component table search, it searches the array 
component table corresponding to the die length of the array component of the element of each 
subset. That is, when performing an array component table search with reference to the subset 
which consists of an array component of the base length k1 , it searches using the array 
component table of the base length k1 too. 

[0084] After creating i subsets, the array component table search 802 is performed, selection 
803 of the array with which retrieval hit conditions are filled in all cases is performed, and array 
retrieval 804 is performed. Here, the retrieval hit conditions in each subset can be expressed as 
what assigned the value of the variable which changed corresponding to the array component 
length ki to the value corresponding to each of (several 6), as shown below. Moreover, it is 
stopped even if it assigns the value corresponding to each variable [ that retrieval time is the 
same as that of an example 6 (several 7) ]. 

[0085] Hereafter, the retrieval result of this example is expressed to drawing 15 . Here, the 
following conditions were used at the time of retrieval, first, a database — an example 6 — the 
same — it is GBPRI.SEQ which collected the base sequences of the primates in GenBank which 
is a public database, and the array length Nd kept step with 1000 characters. The retrieval array 
length Nk considered as 100 characters, and it chose so that the array in a database might not 
be overiapped. As die length of an array component, three values followed from three characters 
to ten characters, respectively were used. That is, 3, the group of 4 or 5 characters, 4, the group 
of 5 or 6 characters. — , 8. and the group of 9 or 10 characters were used as array component 
length. Moreover, it shifted at the time of extracting an array component, and the number ks of 
alphabetic characters used the value and k=ks which are in agreement with the die length of an 
array component, respectively. Moreover, the rate of error permission was made into 10%. The 
result of having searched under the above conditions is shown in the example 2 which referred 
to these conditions at drawing 15 as compared with the result of the approach of a publication. 
[0086] From the result shown in drawing 1 5 . the rate of narrowing down is improved very much 
for the group of an array component in 5, 6 or 7 characters, 6, 7 or 8 characters, 7, 8 or 9 
characters, 8, and the group of 9 or 1 0 characters. In retrieval of this example, only the number 
of array component length which the subset specified is created, and 3 subsets are created. 
From these numeric values, it calculates using (several 8) whether retrieval time is shortened at 
the rate of narrowing down of this example. In the case of the four above-mentioned groups, 
since the rate of narrowing down of this example fulfills the conditions expressed to (several 8), 
retrieval time is shortened. Therefore, this example is effective for improving the rate of 
narrowing down and shortening retrieval time. 

[0087] (Example 8) The 8th example of this invention is hereafter explained using drawing 1 1 and 
drawing 1 6 . In the extract 801 of the array component from the retrieval array in drawing 1 1 , it 
refers to this example combining the approach explained in the example 6 and the example 7. 
That is, the subset from which plurality differs based on the approach of an example 6. 
respectively is created to two or more k from which die length is different. In the extract 801 of 
the array component of this example, it carries out to die-length k from which the plurality which 
had the following procedures specified as shown in drawing 16 differs. First, the array component 
of die length k1 is started from one end of (1) retrieval character array about the array 
component of one die length k1 . 

[0088] (2) Shift ks of one character from the location which started the array component, and 
start the array component of die length k1. 

[0089] (3) While the end of an array component is contained in the retrieval character array, 
repeat procedure actuation of (1) and (2). 

[0090] (4) further — kn character from the logging starting position of the array component in a 

retrieval character array — repeat procedure actuation of (1) ~ (3) after shifting. 

[0091] (5) Repeat procedure actuation of (1) - (4) until the newly extracted array component 

comes to be in agreement with the array component of the subset already created. 

[0092] According to the above procedure, two or more subsets of the array component of k1 

base length are created. In this example, also to the array component of different die length k2 

specified further, procedure actuation of (1) - (5) is repeated and two or more subsets are 
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created. It carries out to i base length which had this procedure actuation specified. Thus, the 
subset which consists of an array component from which plurality differs is created. At this time, 
the number of subsets is set up as follows. First, specified different base length considers as i 
pieces, and expresses each base length as k1, — , ki. the time of creating the subset of the array 
component corresponding to each base length — every subset — the extract starting position 
of an array component — a kni character — it shifts. Several Ns (ki. kni) of the subset in each 
base length are defined as the minimum value from which knixN (ki. kni) becomes the multiple of 
ki. For example, when ki is six characters and kni is two characters. N (ki, kni) is set to 3. The 
number Gn of a subset is given by following (several 9) as a whole. 



[0094] In this way. retrieval 803 which referred to the array component table corresponding to 
each base length ki to the subset of created Gn individual is performed. At the time of retrieval, 
it searches based on the retrieval hit conditions corresponding to each to the subset of N (ki, 
kni) individual in each base length ki. the score S with which retrieval hit conditions were 
searched for about each base length ki — all the cases of each subset — setting (several 6) — . 
or [ being larger than the value expressed ] — or equal retrieval 804 which will refer to a 
character array if it becomes is performed, and the retrieval result 805 is obtained. 
[0095] The rate of narrowing down in this example is expressed to drawing 1 7 . Also in this 
example, the same database as what was used for retrieval in the example 6 and the example 7 
is used at the time of retrieval. Moreover, the variable at the time of an array component extract 
was made into ks=k, and was set to variable kn=1 of each subset creation time. Furthermore, 
three continuous die length was used for the die length of two or more array components from 
three characters to ten characters like the example 7, respectively. Moreover, about the die 
length of an array. 1000 bases and the retrieval array length Nk were made into 100 bases for 
the array length Nd in a database like examples 6 and 7. 

[0096] The result of having searched is expressed under the above conditions to drawing 17 . 
The rate of narrowing down in this example becomes the product of the rate of narrowing down 
in the search method with which it was expressed to the example 6 and the example 7. The rate 
[ in / from drawing 17 / this example ] of narrowing down is improved rather than the approach 
given in an example 7. (Several 7) can express the retrieval time in this example (several 9), it 
substitutes for (several 7) the number Gn of subsets for which (several 9) asked, and can 
calculate retrieval time. Moreover, the conditions of the rate of narrowing down for shortening 
retrieval time should just substitute the number Gn of subsets for (several 8) similarly. Thus, in 
all the cases of the group of an array component which searched, in the investigated result, 
retrieval time is being shortened rather than the approach given in an example 2. Moreover, as 
shown in drawing 1 7 , even if it compares with an approach given in an example 7, about the 
group of the group 3 of an array component, 4 or 5 characters. 4, and 5 or 6 characters, a 
remarkable improvement of the rate of narrowing down is found, and this example is effective for 
improving the rate of narrowing down and shortening retrieval time: 

[0097] (Example 9) The 9th example of this invention is hereafter explained using drawing 1 1 . In 
this example, the judgment 803 of the retrieval conditions in each subset in drawing 1 1 is 
generalized: That is. to two or more subsets created by the extract 801 of the array component 
from a retrieval array, retrieval 802 using an array component table is performed, judgment 803 is 
performed using the retrieval result obtained to each subset, and array retrieval 804 is performed 
to the selected array. 

[0098] In searching using two or more subsets, in consideration of the retrieval hit conditions 
over each subset, it extracts the array with which the computed score fills retrieval hit 
conditions in all subsets. This search method is generalized as follows. First, as examples 6, 7, 



[0093] 
[Equation 9] 



m — 1 
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and 8 explained, from one retrieval array, the subset from which two or more elements differ is 
created. The numbers of array components whose elements of this subset are elements 
respectively according to the creation approach of a subset may differ. Then, retrieval hit 
conditions are set up to each subset, respectively. In retrieval, the array with which retrieval hit 
conditions are filled in all subsets is extracted, here, in each subset, a score fulfills the retrieval 
hit conditions expressed with (several 6) of the subset — if it becomes — 1 — it does not fill — 
if it becomes, the bit flag of 0 will be given. As for the number of these flags, only the number of 
subsets exists. The AND of this flag is calculated and the array in which this count result is set 
to 1 is extracted in an array component table search. 

[0099] Although retrieval of a DNA sequence was taken for the example and the character-array 
search method of this invention was explained in each above example, it cannot be 
overemphasized that this invention is applicable not only to this but retrieval of an RNA base 
sequence, retrieval of an amino acid sequence, and retrieval of a still more common document. 
Moreover, in a document retrieval, also when a notation alphabetic character, a pictorial symbol, 
etc. are included, it can apply. 
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* NOTICES * 

Japan Patent Office is not responsible for any 
damages caused by the use of this translation. 

1 This document has been translated by computer. So the translation may not reflect the original 
precisely. 

2.**** shows the word which can not be translated. 
3.1n the drawings, any words are not translated. 



DESCRIPTION OF DRAWINGS 



[Brief Description of the Drawings] 

[Drawing 1] Drawing showing the configuration of the character-array retrieval equipment with 
which it is the 1 st example of this invention, and a character-array search method is applied. 
[Drawing 2] Drawing showing the character array (example of a DNA sequence) in hierarchical 
PURISACHI in the 1 st example of this invention, and the contents of processing of creation 
registration of a character-array component table. 

[Drawin g 3] Drawing showing the search method of the character array (example of a DNA 
sequence) in hierarchical PURISACHI in the 1st example of this Invention. 
[Drawing 4] Drawing showing the character-array (length k) value dependency over each base 
sequence length of the rate of narrowing down in the 1 st example of this invention (RS), and 
(calculated value). 

[Drawing 5] Drawing showing the extract approach of the array component from a retrieval array 
and search method which are the 2nd example of this invention, and which performed the 
generalization to a duplication split plot experiment. 

[Drawin g 6] Drawing showing the rate (m) dependency of error permission of the rate of 
narrowing down when being referred to as character-array length and k= 6 (RS), and (calculated 
value) in the 2nd example of this invention. 

[Drawing 7] Drawing showing the creation registration approach of the duplication division base 
sequence component table which is the 3rd example of this invention, performs division which 
allowed duplication of the base sequence in a database, and creates a base sequence component 
table to the divided array. 

[ Drawin g 8] Drawing showing the base sequence length (Nd) dependency in the database of the 
rate of narrowing down (RS), and (calculated value). 

[ Drawing 9] Drawing showing the extract approach of a base sequence component of using the 
information on the array end which is the 4th example of this invention. 

[Drawing 10] Drawing showing the creation approach of the array component table of a hashing 
mold of using the frequency information which is the 5th example of this invention. 
[Drawing 1 1] Drawing showing the search method which uses two or more subsets from the 
retrieval character array which is the 6th example of this invention. 

[ Drawing 12] Drawing showing the creation approach of two or more subsets from the retrieval 
character array which is the 6th example of this invention. 

[Drawing 13] Drawing showing the each character-array (length k) value dependency of the rate 
of narrowing down in the 6th example of this invention (RS). 

[Drawing 1 4] Drawing showing the creation approach of two or more subsets from the retrieval 
character array which is the 7th example of this invention. 

[Drawing 15] Drawing showing the each character-array (length k) value dependency of the rate 
of narrowing down in the 7th example of this invention (RS). 

[ Dra wing 16] Drawing showing the creation approach of two or more subsets from the retrieval 
character array which is the 8th example of this invention. 

[D r a wing 17 ] Drawing showing the each character-array (length k) value dependency of the rate 
of narrowing down in the 8th example of this invention (RS). 
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[Description of Notations] 

100 — A display. 101 — A keyboard, 102 — Central control unit CPU 103 — A character array. 
104 — A character-array component table, 105 — Floppy disk driver. 106 — The file for storing 
of a character array, 107 — Floppy disk, 200 — Main memory. 201 — A character-array 
registration program, 202 — Character-array component tabulation registration program, 203 — 
An error permission character-array component table search program. 204 — Error permission 
character-array search program, 205 — A data area, 206 — A hierarchy retrieval control 
program. 300 — DNA sequence registration process, 301 — The extract process of the 
character-array component from a DNA sequence, 302 — DNA sequence component tabulation 
process. 303 — A DNA sequence component table registration process. 400 — The rate m of 
error permission And the input process of a retrieval DNA sequence, 401 — The extract process 
of the array component from a retrieval DNA sequence, 402 — The retrieval process by the 
DNA sequence component table. 403 — The retrieval process of a DNA sequence. 404 — A 
retrieval result output process, 500 — The extract process of the array component by the 
duplication split plot experiment from a retrieval DNA sequence, 501 — The retrieval process by 
the DNA sequence component table, 600 — The duplication division process of a database base 
sequence, 601 — The creation process of a duplication division base sequence component table, 
602 — The registration process of a duplication division base sequence component table. 700 — 
The frequency distribution in the inside of the database of each array component kind, 701 — 
Frequency distribution which rearranged the array component kind in order of frequency. 702 — 
A hashing method. 800 — The rate m of error permission, and the input process of a retrieval 
DNA sequence, 801 — The extract of the array component from a retrieval DNA sequence, and 
the creation process of two or more subsets that the contents differ. 802 [ — Retrieval result 
output process. ] — The retrieval process by the DNA sequence component table using each 
subset. 803 — The judgment process of retrieval conditions using the result of a DNA sequence 
component table search. 804 — The retrieval process of a DNA sequence. 805 
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DRAWINGS 



[Drawing 1] 
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[Drawing 9] 
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[Drawing 10] 
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CORRECTION OR AMENDMENT 



[Kind of official gazette] Printing of amendment by the convention of 2 of Article 1 7 of Patent 
Law 

[Section partition] The 3rd partition of the 6th section 
[Publication date] February 9, Heisei 13 (2001. 2.9) 

[Publication No.] JP,7-105224,A 

[Date of Publication] April 21, Heisei 7 (1995. 4.21) 

[Annual volume number] Open patent official report 7-1053 
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G06F 17/30 
[FI] 

G06F 15/40 370 Z 
[Procedure revision] 

[Filing Date] March 15. Heisei 12 (2000. 3.15) 

[Procedure amendment 1] 

[Document to be Amended] Specification 

[Item(s) to be Amended] Claim 

[Method of Amendment] Modification 

[Proposed Amendment] 

[Claim(s)] 

[Claim 1] The character-array search method which searches the retrieval character array 
specified out of the character-array database which is characterized by providing the following, 
and with which two or more character arrays were registered (1) The step which creates the 
character-array component table which contains without duplication the partial character array 
which the predetermined die length (referred to as k) contained in said registration character 
array follows, and expresses the information about these partial character array (2) The step 
which doubles said registration character array and said character-array component table, and is 
registered to a character-array database (3) The step which extracts the subset of a retrieval 
character array from the character array of said predetermined die length (k) contained in said 
retrieval character array by the predetermined approach (4) Refer to said character-array 
component table for said registration character array containing more character arrays in said 
subset than the fixed numbers defined with the predetermined rate of error permission. The step 
which searchs the character-array component table for extracting the character array of less 
than of said rate of error permission. (5) Step which searchs the character array for extracting 
said registration character array of less than of said rate of error permission with reference to 
said registration character array obtained by the step which searchs said character-array 
component table 
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[Claim 2] The character-array search method which searches the retrieval character array 
specified out of the character-array database which is characterized by providing the following, 
and with which two or more character arrays were registered (1) The step which creates the 
character-array component table which contains without duplication the partial character array 
of the predetermined die length (referred to as k) contained in said registration annular character 
array supposing the annular registration annular character array which connected the both ends 
of said registration character array (2) The step which doubles said registration character array 
and said character-array component table, and is registered to a character-array database (3) 
The step which extracts the subset of a retrieval character array from the character array of 
said predetermined die length (k) contained in said retrieval annular retrieval character array by 
the predetermined approach supposing the annular retrieval annular character array which 
connected the both ends of said retrieval character array (4) Refer to said character-array 
component table for said registration character array containing more character arrays in said 
subset than the fixed numbers defined with the predetermined rate of error permission. The step 
which searchs the character-array component table for extracting the character array of less 
than of said rate of error permission, (5) Step which searchs the character array for extracting 
said registration character array of less than of said rate of error permission with reference to 
said registration character array obtained by the step which searchs said character-array 
component table 

[Claim 3] The character-array search method according to claim 1 or 2 with which said 
character array is characterized by expressing the base sequence of DNA or RNA. 
[Claim 4] The character-array search method according to claim 1 or 2 with which said 
character array is characterized by expressing an amino acid sequence. 



[Translation done.] 
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